NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Don’t stop me now: Embedding based scheduling for LLMs

Shahout, Rana; Malach, Eran; Liu, Chunwei; Jiang, Weifan; Yu, Minlan; Mitzenmacher, Michael (April 2025, ICLR)

Free, publicly-accessible full text available April 24, 2026
The Evolution of Statistical Induction Heads: In-Context Learning Markov Chains

Edelman, Benjamin L; Edelman, Ezra; Goel, Surbhi; Malach, Eran; Tsilivis, Nikolaos (December 2024, 38th Conference on Neural Information Processing Systems (NeurIPS 2024))

Large language models have the ability to generate text that mimics patterns in their inputs. We introduce a simple Markov Chain sequence modeling task in order to study how this in-context learning (ICL) capability emerges. In our setting, each example is sampled from a Markov chain drawn from a prior distribution over Markov chains. Transformers trained on this task form \emph{statistical induction heads} which compute accurate next-token probabilities given the bigram statistics of the context. During the course of training, models pass through multiple phases: after an initial stage in which predictions are uniform, they learn to sub-optimally predict using in-context single-token statistics (unigrams); then, there is a rapid phase transition to the correct in-context bigram solution. We conduct an empirical and theoretical investigation of this multi-phase process, showing how successful learning results from the interaction between the transformer's layers, and uncovering evidence that the presence of the simpler unigram solution may delay formation of the final bigram solution. We examine how learning is affected by varying the prior distribution over Markov chains, and consider the generalization of our in-context learning of Markov chains (ICL-MC) task to n-grams for n is greater than 2.
more » « less
Full Text Available
On the Power of Decision Trees in Auto-Regressive Language Modeling

Gan, Yulu; Galanti, Tomer; Poggio, Tomaso; Malach, Eran (September 2024, Center for Brains, Minds and Machines (CBMM))

Originally proposed for handling time series data, Auto-regressive Decision Trees (ARDTs) have not yet been explored for language modeling. This paper delves into both the theoretical and practical applications of ARDTs in this new context. We theoretically demonstrate that ARDTs can compute complex functions, such as simulating automata, Turing machines, and sparse circuits, by leveraging "chain-of-thought" computations. Our analysis provides bounds on the size, depth, and computational efficiency of ARDTs, highlighting their surprising computational power. Empirically, we train ARDTs on simple language generation tasks, showing that they can learn to generate coherent and grammatically correct text on par with a smaller Transformer model. Additionally, we show that ARDTs can be used on top of transformer representations to solve complex reasoning tasks. This research reveals the unique computational abilities of ARDTs, aiming to broaden the architectural diversity in language model development.
more » « less
Full Text Available
Is ML-Based Cryptanalysis Inherently Limited? Simulating Cryptographic Adversaries via Gradient-Based Methods

Shafran, Avital; Malach, Eran; Ristenpart, Thomas; Segev, Gil; Tessaro, Stefano (August 2024, CRYPTO 2024)

Full Text Available
Quantifying the Benefit of Using Differentiable Learning over Tangent Kernels

Malach, Eran; Kamath, Pritish; Abbe, Emmanuel; Srebro, Nathan (July 2021, Proceedings of Machine Learning Research)
null (Ed.)
We study the relative power of learning with gradient descent on differentiable models, such as neural networks, versus using the corresponding tangent kernels. We show that under certain conditions, gradient descent achieves small error only if a related tangent kernel method achieves a non-trivial advantage over random guessing (a.k.a. weak learning), though this advantage might be very small even when gradient descent can achieve arbitrarily high accuracy. Complementing this, we show that without these conditions, gradient descent can in fact learn with small error even when no kernel method, in particular using the tangent kernel, can achieve a non-trivial advantage over random guessing.
more » « less
Full Text Available

Search for: All records